How to (accurately) skip past streams

نویسندگان

  • Supratik Bhattacharyya
  • André Madeira
  • S. Muthukrishnan
  • Tao Ye
چکیده

For processing massive data streams, most proposed algorithmic methods look at each new item, perform a small number of operations while keeping a small amount of memory, and still perform much-needed analyses. However, in many situations, the update speed per item is very critical and not every item can be extensively examined. In practice, this has been addressed by sampling only a subset of items (say 1 in N) from the input, but it results in loss of guarantees on the accuracy of the post-hoc analyses. In this paper, we present a technique of skipping past streams. Unlike traditional sampling approaches, our skipping is performed in a principled manner without significant loss of guarantees on post-hoc analyses, while substantially improving the processing rate. Using this technique on top of well-known sketches, we show improvements in the update time as well as guaranteed accuracy for a number of stream processing problems including data summarization, heavy hitters detection and self-join size estimation. We present experimental results of our methods over synthetic data and integrate our methods into Sprint’s Continuous Monitoring (CMON) system for live network traffic analyses. Furthermore, going beyond traditional packet header analyses, we show how the packet contents can be analyzed at streaming speeds, a more challenging task because each packet content can result in many updates.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Introducing New Priority Setting and Resource Allocation Processes in a Canadian Healthcare Organization: A Case Study Analysis Informed by Multiple Streams Theory

Background In this article, we analyze one case instance of how proposals for change to the priority setting and resource allocation (PSRA) processes at a Canadian healthcare institution reached the decision agenda of the organization’s senior leadership. We adopt key concepts from an established policy studies framework – Kingdon’s multiple streams theory – to inform our analysis.   Methods Tw...

متن کامل

Group Testing in Statistical Signal Recovery

Over the past decade, we have seen a dramatic increase in our ability to collect massive data sets. Concomitantly, our need to process, compress, store, analyize, and summarize these data sets has grown as well. Scientific, engineering, medical, and industrial applications require that we carry out these tasks efficiently and reasonably accurately. Data streams are one type or model of massive ...

متن کامل

Priority Setting Meets Multiple Streams: A Match to Be Further Examined?; Comment on “Introducing New Priority Setting and Resource Allocation Processes in a Canadian Healthcare Organization: A Case Study Analysis Informed by Multiple Streams Theory”

With demand for health services continuing to grow as populations age and new technologies emerge to meet health needs, healthcare policy-makers are under constant pressure to set priorities, ie, to make choices about the health services that can and cannot be funded within available resources. In a recent paper, Smith et al apply an influential policy studies framework – Kingdon’s multiple str...

متن کامل

A Closer Look at Skip-gram Modelling

Data sparsity is a large problem in natural language processing that refers to the fact that language is a system of rare events, so varied and complex, that even using an extremely large corpus, we can never accurately model all possible strings of words. This paper examines the use of skip-grams (a technique where by n-grams are still stored to model language, but they allow for tokens to be ...

متن کامل

Frequency Effects of Regular Past Tense Forms in English on Native Speakers’ and Second Language Learners’ Accuracy Rate and Reaction Time

There is substantial debate over the mental representation of regular past tense forms in both first language (L1) and second language (L2) processing. Specifically, the controversy revolves around the nature of morphologically complex forms such as the past tense –ed in English and how morphological structures of such forms are represented in the mental lexicon. This study focuses on the resul...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2006